Is the Information Entropy the Same as the Statistical Mechanical Entropy? 
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It is shown that the standard expression for the information entropy, originally due to Shannon, 
is only valid for a particular set of states. For the general case of statistical mechanics, one needs 
to include an additional term in the expression for the entropy as a function of the probability. A 
simple derivation of the general formula is given. 



Introduction 



For a system that can exist in states i = 1,2, ... ,n 
with probability distribution p = {pi,p 2 , . . . ,p n } that is 
normalised, X)"=iPi = 1j the information entropy is 



S[p\ = -fc B y^Pilnpi 



(1) 



Boltzmann's constant, /cb, is often suppressed in infor- 
matic applications, and the base of the logarithm is often 
taken to be 2. The information entropy is a measure of 
the disorder or lack of predictability of the system. 

This formula for the entropy was originally introduced 
by Shannon as a way of quantifying the information con- 
tent of messages*^ (Actually the same formula was also 
given by Gibbs^ who called it the average of the index 
of probability. It also coincides with Boltzmann's H- 
function.) The formula remains standard in information 
theory. It is also widely used to calculate the entropy in 
statistical mechanics. Shannon's implicit assumption — 
that this information entropy is the same as the en- 
tropy that is used in thermodynamics and in statistical 
mechanics — has generally been accepted unquestioned, 
and the formula has been used without modification in 
the physical sciences. Jaynes based his maximum en- 
tropy formulation of statistical mechanics (wherein the 
appropriate probability distribution is obtained by max- 
imising the information entropy subject to certain con- 
straints) on Eq. ([!]),— and went on to use it as the basis 
of a novel interpretation of logic and probability that has 
widespread ramifications. 5,6 It is also worth mentioning 
that statistical mechanical expansions have been used to 
develop efficient methods for evaluating the information 
entropy, Eq. 

This note examines in detail Shannon's derivation 
of Eq. (JTJfi^i which remains the standard text-book 
derivations 6 *^ In Sec. 2 it is shown that the formula ne- 
glects the internal entropy of the states, due to a certain 
confusion between the total entropy and the entropy as a 
functional of the macrostate probability. In Sec. 3 a sim- 
ple derivation of the full result is reproduced*^— and in 
Sec. 4 it is discussed why this additional term is essen- 
tial to obtain the correct physical results and agreement 
with the thermodynamic and the statistical mechanical 
entropy. 



2. Derivation of the Information Entropy 

Following Shannon's original derivation (see Appendix 
2 of Ref. 1, or Refs 6.10), and changing only the notation 
and adding a few clarifying remarks, the derivation of the 
formula for the information entropy is based on several 
self-evident axioms: that it is a continuous function of 
the pi , that it increases with the number of states in the 
uniform case, and that different ways of grouping the 
states must give the same value. 

To understand this last axiom define a microstate 
as a complete set of distinct, disjoint, and indivisible 
states, and a macrostate as a set of microstates. Let 
i label the microstates used above, with probabilities 
Pi, and let I label the macrostates. The macrostate / 
contains nj — JZjcj microstates, and its probability is 
qi = J2ieiPi- The conditional probability of a microstate 
in a given macrostate is p^ij = Pi/qi, i E I, which can 
be written as the n/-component vector {pi/qi}. In view 
of this definition, the total entropy can be written as the 
uncertainty due to the macrostate probability distribu- 
tion, plus the additional uncertainty due to locating the 
microstate within each macrostate, 



S\p] = S[q}+J2llS[{ Pi /q I }} 



(2) 



The last axiom says that no matter how the microstates 
are grouped into macrostates, the total entropy must re- 
main unchanged. 

This equation is written in the standard form (see, for 
example, Eq. (11.8) of Ref.ra, or the penultimate equa- 
tion of Appendix 2 of Ref. [0) , but there is a subtle am- 
biguity with it that ultimately causes the problem with 
the information entropy. On the one hand, the notation 
implies that the entropy expressed as a functional of the 
microstate probability on the left hand side, S[p], is the 
same function of its argument as the entropy expressed 
as a functional of the macrostate probability on the right 
hand side, S[q]. On the other hand, the left hand side of 
the equation implies that the microstate form S[p] is the 
total entropy of the system, whereas the right hand side 
implies that the macrostate form S[q] is only part of the 
total entropy of the system. The expression for the in- 
formation entropy, Eq. ([T]), results from overlooking this 
distinction, as is discussed below. 

Following Shannon's Appendix 2,— or Jaynes' 
Eq. (11. lO)* 6 - assume that the microstates are equally 
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likely, p" = 1/n, so that the left hand side of Eq. ([2]) is 

S\f] = *(n). (3) 

Further assume that the macrostates are all the same 
size, nj — m, so that qi — m/n and pjij = 1/m. Hence 
5[g u ] = ain/ra) and <S[{pi/g.r}] = <j(m). With these two 
assumptions Eq. ([5]) becomes 

<j{n) = a(n/m) +'S~^qi<j(m) 
i 

= a(n/m) + <r(m). (4) 

This has evident solution 

a(n) — fee Inn. (5) 

Apart from the choice of Boltzmann's constant and the 
logarithmic base, this solution is unique. 1 ' 6 This result 
may be seen to be consistent with Boltzmann's original 
definition of the entropy, namely that the entropy of a 
state is the logarithm of the number of molecular con- 
figurations in the state, assuming that the configurations 
are all equally likely. 

Now in order to deduce the form of the information 
entropy as a functional of the probability distribution, 
continue to take the microstates to be equally probable, 
pf = 1/n, but now take the the macrostate distribution 
to be non- uniform, qi =fi qj. This is achieved by grouping 
different numbers of microstates nj into each macrostate 
In this case rearranging Eq. ^ yields 

%] = Sip^-^liSM/qi}} 
i 

= - X^ /cr ( n7 ) 

= -fa^grlngi, (6) 

since qi = ni/n. With the apparently trivial replacement 
/ i, this is the standard expression for the information 
entropy, and so this appears to be a straightforward and 
unambiguous derivation of Eq. (TTJ . 

However, as noted above, there is a conceptual prob- 
lem with Eq. ([2]), namely that the notation implies that 
the microstate form S[p] and the macrostate form S[q] 
are the same functions of their arguments, whereas the 
content of the equation implies that S[p] is the total en- 
tropy whereas S[q] is only part of the total entropy. The 
only way to resolve this discrepancy is to rewrite Eq. ([2]) 
in terms of the macrostate probability with the total en- 
tropy explicit, 

Stotal = %] +^2<li s [{Pi/<li}}- ( 7 ) 
/ 

The total entropy has to be independent of how the 
microstates are grouped into macrostates, (Axiom 4). 
Hence this result can also be used when the macrostates 



are taken to be microstates themselves (i.e. one mi- 
crostate in each macrostate) , in which case this becomes 

Stotal = S[p] + £>S[fe/ft}] = S\p] + J>*(1). ( 8 ) 

i i 

The simplest assumption is that there is no uncertainty 
for a system consisting of a single microstate, cr(l) = 0. 
(This assumption will be revisited in the next section.) 
With these two equations, most of the preceding analy- 
sis holds. Assuming a uniform distribution of both mi- 
crostates and macrostates one again finds that Stotai = 
er(n) = fee Inn. Assuming non-uniform macrostates, 
qi 7^ qj, and uniform microstates, pf = 1/n does not 
change the total entropy, Stotal = fa Inn. In this case 
Eq. ([7]) may be rearranged as 

S[q\ = fcslnn -y^gja(nj) = -fa ^ q/lnq/. (9) 
/ i 

This has the appearance of the standard information en- 
tropy result, but now Eq. (0 shows explicitly that this 
is only part of the entropy of the system. Inserting this 
into Eq. ([7]) gives the total entropy as 

■Stotal = -fa / ] gj In gj + /] qiSi, (10) 

where in this case the internal entropy of the macrostate 
for equally likely microstates is Sf — S[{pf/qi}] — 
a(ni). (This explicit result for the internal entropy of 
the macrostate does not hold if the microstates are non- 
uniformly distributed, but Eq. dTUl) does hold in this gen- 
eral case, as is shown in the next section.) 

This formula remains predicated on the assump- 
tion that the underlying microstates are uniformly dis- 
tributed. A derivation will be given in the following sec- 
tion without this restriction. Before then however two 
comments can be made. First, if the microstates are in 
reality uniformly distributed, pf = 1/n, then it turns out 
that the internal entropy of a microstate can indeed be 
taken to be zero, 5" = 0, and this result for the total 
entropy expressed in terms of microstates reduces to the 
information entropy form, Eq. ([T]). In this case the result 
is trivial 

Siotal = "fa = fcB W C 11 ) 

i 

Because this result is trivial (and in essence just a state- 
ment of Boltzmann's original definition of the statistical 
mechanical entropy) there really is no point in formulat- 
ing the total entropy in the information entropy form for 
uniformly distributed microstates. 

Second, although both Eq. ((9|) and Eq. (fTQ|) are true 
for non-uniformly distributed macrostates, the deriva- 
tion depends upon the microstates being uniformly dis- 
tributed. Additional assumptions or a different deriva- 
tion is required when the microstates are not uniformly 
distributed. Similarly, there is nothing in the deriva- 
tion that says that either of these two equations can 



3 



be applied to non-uniformly distributed microstates (i.e. 
I =>• i), which Shannon 1 and Jaynes^ both assume in 
their derivation of Eq. (JT]) . 



3. Derivation for Non-Uniform Microstates 

A simpler and more general (in the sense that it applies 
for non-uniformly distributed microstates) derivation of 
the entropy as a functional of the probability distribu- 
tion has been given by the author»ii~— Suppose that the 
microstate i has weight Wj. These weights arise from the 
molecular details of the problem and they need not be 
specified explicitly beyond the fact that they are non- 
negative and linearly additive, which are required for the 
system to satisfy the laws of probability. 

Because the states are distinct, disjoint, and complete, 
and because the weights are linearly additive, the weight 
of a macrostate is Wj = ^2 ieI Wi, and the total weight is 



w = 12 w i = J2 Wi - 



(12) 



The probability of a state is proportional to its weight, 



Pi 



W 



and qj 



Wi 



(13) 



These are obviously normalised. The total weight is 
called the partition function in statistical mechanics. 

Weight is the obvious generalisation of number in the 
case that the states are not equally likely. Hence the en- 
tropy of a state is defined to be Boltzmann's constant 
times the logarithm of the weight of the state. The mi- 
crostate, macrostate, and total entropy are 

Si = k B hxwi, Si = k B hxWj, and 5 to tai = k B In W, 

(14) 

respectively. It turns out that the entropy defined like 
this has all the properties that one would desire of the en- 
tropy in physical systems. The logarithm makes entropy 
an extensive variable, like energy, number, and volume, 
and this is what makes it so convenient for thermody- 
namics and statistical mechanics. 

In view of the relationship between entropy and weight, 
the probability of a state is proportional to the exponen- 
tial of the entropy divided by Boltzmann's constant, 



Pi 



3 Si/fc E 
W 



-, and qi = 



W 



(15) 



This generic relationship between entropy and probabil- 
ity is well known in statistical mechanics. 

Now to the major result, namely the information en- 
tropy for the case of non-uniform probability distribu- 
tions. In view of the normalisation of the probability 
distributions it is straight forward to verify that 

Stotal = k B \nW 



W 

In h In Wi 

Wi 



i 

= k B y^p» 

i 

= -ks}^Pi In |>i + y^pjSj. (16) 

i i 

In the above formulation there was no fundamental dis- 
tinction between macrostates and microstates, and so the 
total entropy can equivalently be written as a functional 
of the macrostate probability, 



Stotal = -k& ^2 qi ln<7/ + ^ 



(17) 



These are in the form of Eq. (|10|l (for both microstates 
and macrostates) but not in the form of the information 
entropy expression, Eq. (fTJ). 



4- Dtscussion 

This obvious contradiction between the information 
entropy result Eq. ([l} and the statistical mechanical en- 
tropy result, Eq. ([IB]), or Eq. ([XT]), raises the question: 
when is it valid to neglect the internal entropy of the 
states? 

For the case that the microstates are equally likely, one 
can indeed set their weight to unity, Wi — 1, and their 
entropy to zero, Si — 0. But in this case Eq. ([T]) reduces 
to the trivial result, Stotal = k B Inn, and there is no cir- 
cumstance when the formula in terms of the probability 
distribution is required. 

In the case that the microstates are not equally likely, 
one cannot neglect their internal entropy without ex- 
plicit justification. In statistical mechanics the generic 
cases where the microstates are not equally likely usu- 
ally involve hidden variables that are projected out of 
the problem. One example is where one describes the 
microstates of the system in terms of molecular coordi- 
nates (e.g. the position and momenta of the center of 
mass of the molecules). This neglect finer levels of de- 
scription, (e.g. the rotational coordinates, the bending 
and stretching of, and the rotation about intramolecu- 
lar bonds, the electron configuration, etc.). Depending 
upon the specific system, there may be good reason for 
neglecting such finer levels of description, and there may 
also be good reason for neglecting the internal entropy 
due to them (e.g. the internal configurations and their 
weight might be the same for all the microstates, and so 
the internal entropy might be constant). 

The second example of the microstates having an in- 
ternal entropy is where the total system consists of a sub- 
system and a reservoir, and the coordinates of the sub- 
system form the microstates. For example, the canon- 
ical equilibrium system is that of a sub-system in con- 
tact with a thermal reservoir of temperature T. Here 
the microstates are the points in the phase space of the 
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sub-system, (the space of positions and momenta of the 
atoms of the sub-system). However each sub-system 
phase space point T corresponds to multiple phase space 
points of the reservoir, and the internal entropy associ- 
ated with each phase space point of the sub-system is 
S(T) = -H(T)/T, where U is the Hamiltonian or to- 
tal energy of the sub-system. This may be recognised as 
the change in entropy of the reservoir. Neglecting this 
microstate entropy by using the information theory ex- 
pression, Eq. (TTJ) , would give the wrong total entropy for 
the canonical equilibrium system. With the Maxwell- 
Boltzmann distribution, p(r) = Z^ 1 exp — H(T)/k&T, 
the information entropy form gives the total entropy as 

Stotai = -ka J dr P (r)inp(r) 

= k B lnZ+^(H(T)), (18) 
whereas the full expression gives 

S'total - J dr p(T) [-fe In p(T) + S(T)} 

= k B \nZ. (19) 

Clearly, only the full expression agrees with the well- 
known result that the logarithm of the partition function 
gives the total entropy of the system. The information 
entropy expression gives the entropy of the sub-system 
alone, neglecting the reservoir entropy. Obviously, in 
seeking, for example, to optimise a system, one should 
maximise the total entropy, not just part of it. 

From this example, one can see explicitly why the in- 
formation entropy expression, Eq. (TTJ), is inappropriate 
for statistical mechanics. The question remains: in what 
sense is it appropriate for information theory? 

The information entropy singles out a particular repre- 
sentation as the preferred representation, namely the one 
in which ^2nPiSi = 0. Since only differences in entropy 



are significant, it is always possible to add a constant to 
the entropy, 

Si = Si + c. Si = Si + c, and S'total = S'total + c - (20) 

Choosing c = — ^2iPi$i (either explicitly or implicitly), 
one sees that 

S'total = -k B ^ Pi In Pi 

i 

= -k^^ qi \xiqi + ^qiSi. (21) 

The first equality is in the form of the information en- 
tropy, but the second equality shows that any shift to 
another set of states requires the full expression for the 
total entropy. 

In communications and informatic problems, one gen- 
erally does not have an underlying molecular description 
of the message, signal, image, etc., and so one cannot give 
the actual value of the weight or the entropy of a state. 
However, one can measure the probability of a state with 
relative ease. The apparent advantage of Eq. (fTJ) and of 
the first equality here is that it depends only on the prob- 
ability distribution, not explicitly upon the entropy of the 
states. Further, it is often the case that the microstates 
are chosen based on the individual characters in a mes- 
sage (or the pixels in an image), and these are indivisible 
and arguably have no internal rearrangement that can 
contribute to the information content of the message. In 
this sense one can make a strong argument for applying 
Eq. (TT|) in informatic applications provided that one ex- 
plicitly restricts its use to such microstates. One should 
not use that particular equation in statistical mechan- 
ics because it risks confusion between microstates and 
macrostates, and also because it is not valid for the mi- 
crostates that typically appear in statistical mechanics. 
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